Search CORE

On rounding error resilience, maximal attainable accuracy and parallel performance of the pipelined Conjugate Gradients method for large-scale linear systems in PETSc

Author: Agullo Emmanuel
Cools Siegfried
Giraud Luc
Vanroose Wim
Yetkin Emrullah Fatih
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

International audiencePipelined Krylov solvers typically display better strong scaling compared to standard Krylov methods for large linear systems. The synchronization bottleneck is mitigated by overlapping time-consuming global communications with computations. To achieve this hiding of communication, pipelined methods feature additional recurrence relations on auxiliary variables. This paper analyzes why rounding error effects have a significantly larger impact on the accuracy of pipelined algorithms. An algebraic model for the accumulation of rounding errors in the (pipelined) CG algorithm is derived. Furthermore, an automated residual replacement strategy is proposed to reduce the effect of rounding errors on the final solution. MPI parallel performance tests implemented in PETSc on an Intel Xeon X5660 cluster show that the pipelined CG method with automated residual replacement is more resilient to rounding errors while maintaining the efficient parallel performance obtained by pipelining

On soft errors in the Conjugate Gradient method: sensitivity and robust numerical detection: Sur les soft-erreurs dans la méthode du Gradient Conjugué: sensibilité et détection numérique robuste

Author: Agullo Emmanuel
Cools Siegfried
Fatih-Yetkin Emrullah
Giraud Luc
Vanroose Wim
Publication venue: HAL CCSD
Publication date: 21/11/2018
Field of study

The conjugate gradient (CG) method is the most widely used iterative scheme forthe solution of large sparse systems of linear equations when the matrix is symmetric positivedefinite. Although more than sixty year old, it is still a serious candidate for extreme-scalecomputation on large computing platforms. On the technological side, the continuous shrinkingof transistor geometry and the increasing complexity of these devices affect dramatically theirsensitivity to natural radiation, and thus diminish their reliability. One of the most common effectsproduced by natural radiation is the single event upset which consists in a bit-flip in a memory cellproducing unexpected results at application level. Consequently, the future computing facilitiesat extreme scale might be more prone to errors of any kind including bit-flip during calculation.These numerical and technological observations are the main motivations for this work, where wefirst investigate through extensive numerical experiments the sensitivity of CG to bit-flips in itsmain computationally intensive kernels, namely the matrix-vector product and the preconditionerapplication. We further propose numerical criteria to detect the occurrence of such faults; we assesstheir robustness through extensive numerical experiments.La méthode du gradient conjugue (CG) est la méthode itérative la plus utiliséespour résoudre des ssytèmes linéaires creux de grande taille lorsque la matrice est symétriquedéfinie positive. Bien que vieille de de soixante ans, cette méthode reste une candidate sérieusepour être mise en œuvre pour la résolution de très grands systèmes linéaires sur des plateformesde calcul de très grande taille. Sur le plan technologique, la réduction permanente de la taille et lacomplexité croissante des composantes électroniques de ces calculateurs affecte dramatiquementleur sensibilité aux radiations cosmiques ce qui réduit leur fiabilité. L’un des effets les pluscourants des rayonnements naturels est la perturbation due à un événement unique qui consisteen un retournement de bit dans une cellule mémoire produisant des résultats inattendus auniveau de l’application. Par conséquent, les futures installations informatiques à très grandeéchelle pourraient être plus sujettes à des erreurs de toute sorte. y compris le basculement de bitpendant le calcul. Ces observations numériques et technologiques sont les suivantes les principalesmotivations de ce travail, pour lequel nous étudions d’abord par le biais d’études approfondies etapprofondies la sensibilité de la CG aux sauts de bits dans ses principaux domaines d’application.à forte intensité de calcul, à savoir le produit matrice-vecteur et le produit application dupréconditionneur. Nous proposons en outre des critères numériques pour détecter l’apparition detels défauts ; nous évaluons leur robustesse à travers des expériences numériques approfondie

Analysis of rounding error accumulation in Conjugate Gradients to improve the maximal attainable accuracy of pipelined CG

Author: Agullo Emmanuel
Cools Siegfried
Giraud Luc
Vanroose Wim
Yetkin Emrullah Fatih
Publication venue: HAL CCSD
Publication date: 01/01/2016
Field of study

Pipelined Krylov solvers typically offer better scalability in the strong scaling limit compared to standard Krylov methods. The synchronization bottleneck is mitigated by overlapping time-consuming global communications with useful computations in the algorithm. However, to achieve this communication hiding strategy, pipelined methods feature multiple recurrence relations on additional auxiliary variables to update the guess for the solution. This paper aims at studying the influence of rounding errors on the convergence of the pipelined Conjugate Gradient method. It is analyzed why rounding effects have a significantly larger impact on the maximal attainable accuracy of the pipelined CG algorithm compared to the traditional CG method. Furthermore, an algebraic model for the accumulation of rounding errors throughout the (pipelined) CG algorithm is derived. Based on this rounding error model, we then propose an automated residual replacement strategy to reduce the effect of rounding errors on the final iterative solution. The resulting pipelined CG method with automated residual replacement improves the maximal attainable accuracy of pipelined CG to a precision comparable to that of standard CG, while maintaining the efficient parallel performance of the pipelined method

On soft errors in the conjugate gradient method: sensitivity and robust numerical detection

Author: Agullo Emmanuel
Cools Siegfried
Fatih-Yetkin Emrullah
Giraud Luc
Schenkels Nick
Vanroose Wim
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2020
Field of study

International audienceThe conjugate gradient (CG) method is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. Although more than 60 years old, it is still a serious candidate for extreme-scale computations on large computing platforms. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation and thus diminish their reliability. One of the most common effects produced by natural radiation is the single event upset which consists in a bit-flip in a memory cell producing unexpected results at the application level. Consequently, future extreme-scale computing facilities will be more prone to errors of any kind, including bit-flips, during their calculations. These numerical and technological observations are the main motivations for this work, where we first investigate through extensive numerical experiments the sensitivity of CG to bit-flips in its main computationally intensive kernels, namely the matrix-vector product and the preconditioner application. We further propose numerical criteria to detect the occurrence of such soft errors and assess their robustness through extensive numerical experiments

A complementary note on soft errors in the Conjugate Gradient method: the persistent error case

Author: Agullo Emmanuel
Cools Siegfried
Fatih-Yetkin Emrullah
Giraud Luc
Schenkels Nick
Vanroose Wim
Publication venue: HAL CCSD
Publication date: 25/08/2020
Field of study

This note is a follow up study to [1], where we studied the resilience of the preconditioned conjugate gradient method (PCG). We complement the original work by performinga similar series of numerical experiments, but using what we called persistent instead of transient bit-flips.Cette note est une étude qui fait suite à [1], où nous avons étudié la résilience de la méthode du gradient conjugué préconditionné (PCG). Nous complétons le travail initial en effectuant une série similaire d’expériences numériques, mais en utilisant ce que nous avons appelé des bit-flips persistants au lieu de transitoires

On soft errors in the Conjugate Gradient method: sensitivity and robust numerical detection -revised

Author: Agullo Emmanuel
Cools Siegfried
Fatih-Yetkin Emrullah
Giraud Luc
Schenkels Nick
Vanroose Wim
Publication venue: HAL CCSD
Publication date: 01/01/2020
Field of study

The conjugate gradient (CG) method is the most widely used iterative scheme forthe solution of large sparse systems of linear equations when the matrix is symmetric positivedefinite. Although more than sixty year old, it is still a serious candidate for extreme-scalecomputation on large computing platforms. On the technological side, the continuous shrinkingof transistor geometry and the increasing complexity of these devices affect dramatically theirsensitivity to natural radiation, and thus diminish their reliability. One of the most common effectsproduced by natural radiation is the single event upset which consists in a bit-flip in a memory cellproducing unexpected results at application level. Consequently, the future computing facilitiesat extreme scale might be more prone to errors of any kind including bit-flip during calculation.These numerical and technological observations are the main motivations for this work, where wefirst investigate through extensive numerical experiments the sensitivity of CG to bit-flips in itsmain computationally intensive kernels, namely the matrix-vector product and the preconditionerapplication. We further propose numerical criteria to detect the occurrence of such soft errors; weassess their robustness through extensive numerical experiments.La méthode du gradient conjugue (CG) est la méthode itérative la plus utilisée pour résoudre des systèmes linéaires creux de grande taille lorsque la matrice est symétrique définie positive. Bien que vieille de de soixante ans, cette méthode reste une candidate sérieuse pour être mise en œuvre pour la résolution de très grands systèmes linéaires sur des plateformes de calcul de très grande taille. Sur le plan technologique, la réduction permanente de la taille et la complexité croissante des composantes électroniques de ces calculateurs affecte dramatiquement leur sensibilité aux radiations cosmiques ce qui réduit leur fiabilité. L’un des effets les plus courants des rayonnements naturels est la perturbation due à un événement unique qui consiste en un retournement de bit dans une cellule mémoire produisant des résultats inattendus au niveau de l’application. Par conséquent, les futures installations informatiques à très grande échelle pourraient être plus sujettes à des erreurs de toute sorte. y compris le basculement de bit pendant le calcul. Ces observations numériques et technologiques sont les suivantes les principales motivations de ce travail, pour lequel nous étudions d’abord par le biais d’études approfondies et approfondies la sensibilité de la CG aux sauts de bits dans ses principaux domaines d’application.à forte intensité de calcul, à savoir le produit matrice-vecteur et le produit application du préconditionneur. Nous proposons en outre des critères numériques pour détecter l’apparition de tels défauts ; nous évaluons leur robustesse à travers des expériences numériques approfondies

Soft Error in PCG: Sensitivity, Numerical Detections and Possible Recoveries

Author: Agullo Emmanuel
Giraud Luc
Yetkin Emrullah Fatih
Publication venue: HAL CCSD
Publication date: 27/02/2017
Field of study

International audienc

Reliability of Checksum based Detection for Soft Errors in Conjugate Gradient Variants

Author: Agullo Emmanuel
Giraud Luc
Yetkin Emrullah Fatih
Publication venue: HAL CCSD
Publication date: 14/03/2015
Field of study

International audienceSoft errors that are not detected by hardware mechanisms may be extremely complex to detect at the software layer. One option is to perform a full duplication of the computation (and data) and check on a regular basis that intermediate results are consistent. However, this mechanism may be prohibitive. In the context of CG solver, the most prohibitive operation to duplicate is SpMV. To avoid the duplication of this operation, checksum mechanisms may be employed. In this presentation, we investigate the reliability of such an approach in finite precision arithmetic. We illustrate our discussion with the CGPOP code, a miniapp for performing the CG within the Parallel Ocean Program (POP), which is a candidate for exascale climate simulations

On Resiliency in Krylov Solvers

Author: Agullo Emmanuel
Giraud Luc
Salas Pablo
Yetkin Emrullah Fatih
Zounon Mawussi
Publication venue: HAL CCSD
Publication date: 01/06/2015
Field of study

International audienceIn this talk we will discuss possible numerical remedies to survive data loss in some numerical linear algebra solvers namely Krylov subspace linear solvers and some widely used eigensolvers. Assuming that a separate mechanism ensures fault detection, we propose numerical algorithms to extract relevant information from available data after a fault. After data extraction, well chosen part of missing data is regenerated through interpolation strategies to constitute meaningful inputs to numerically restart. We will also present some preliminary investigations to address soft error detection again at the application level in the conjugate gradient framework